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The approach based on paradigm of self-organized criticality proposed for experimental investi- 
gation and theoretical modelling of software evolution. The dynamics of modifications studied for 
three free, open source programs Mozilla, Free-BSD and Emacs using the data from version control 
systems. Scaling laws typical for the self-organization criticality found. The model of software evo- 
lution presenting the natural selection principle is proposed. The results of numerical and analytical 
investigation of the model are presented. They are in a good agreement with the data collected for 
the real-world software. 



The basic self-organization mechanisms of complex 
systems in the Nature are intensively studied last years. 
The proposed in the pioneering paper of P.Bak, C.Thang 
and K.Wiesenfeld [l| paradigm of self-organized criti- 
cality (SOC) appeared to be most fruitful here. The 
SOC dynamics is characterized by avalanche- like changes 
of the system state with power law statistics of the 
avalanche growth. The main feature of the SOC regime is 
that it is an attractor of the system dynamics approached 
without any fine tuning of control parameters. 

Studies of the fossil records have shown that the bio- 
logical evolution is a strong non-equilibrium process with 
long periods of stasis interrupted by avalanches of large 
changes in biosphere . This is a main point of the punctu- 
ated equilibrium conception of biological evolution sug- 
gested by E.Gould and H.Eldridg 0,0] ■ Detailed quanti- 
tative analysis of paleontological dates revealed the scal- 
ing power laws in distributions of avalanches in extinction 
and creations of species 0] . Therefore the biological 
evolution can be considered as a kind of SOC dynamics. 
This has been demonstrated by P.Bak and K.Sneppen in 
the proposed model of Darwinian selection in ecosystem 
0]. The development of computer science and engineer- 
ing created the "virtual biosphere" with specific evolution 
laws of "virtual species" - computer programs. In this 
paper we propose an approach to the studies of software 
evolution in the framework of the SOC conception. 

" Life" of large computer program is a perfect example 
of evolutionary process in complex system. During its 
creation the program often undergoes multiple internal 
reorganizations. New devices and platforms supported, 
new features added, system tuning performed, erroneous 
code corrected, huge number of cosmetic chaiiges going 
on during the development of any program 00. De- 
spite of the fact that the first papers on software evolu- 
tion study are now decades old, the universal mechanisms 
of computer program evolution are unclear. The most 
of existing in this region research methodics are based 
on assumption that estimations of possible changes in a 
program can be obtained without taking into account un- 
derlying dynamical laws creating this system 0, 0] . In 
the multitude of papers the authors propose statistical 



methods predicting the number of defects in a program 
using of some kind of metrics describing complexity, size, 
volume etc. |lll Il2| . 

From our point of view the main disadvantage of such 
approach is that even the best in the world static met- 
ric which forecasts a number of improvements to be done 
in computer program to correspond a given specification, 
becomes useless if the specification changes in time essen- 
tially. Our approach can be considered as an elaboration 
of a prototype for dynamical metrics based on the use of 
characteristics of SOC universality class of the system. 

There is a lot of phenomenological work has been done 
on software evolution. Lehman's laws suggest that as sys- 
tem grows in size, it becomes increasingly difficult to add 
new code unless explicit steps are taken to reorganize the 
overall design |13l 1 1 11] . There were some systems exam- 
ined both at system level and within the top-level sub- 
systems. It has been noted that subsystems can behave 
quite differently from the system as whole 0, 0] Good 
metaphors such as "code decay" has been proposed to 
describe the continuous process that makes the software 
more brittle over time 0,0]. Thus the software evolution 
has many similar features with the evolution of biolog- 
ical species, and one can expect that evolution of large 
computer program presents some class of universality of 
the SOC dynamics. 

To study software evolution processes it is necessary to 
have information about the state of the system in differ- 
ent moments of time. The usual sources of such data are 
various versions or releases of a product 0, 0] . Un- 
fortunately, the number of releases rarely exceeds a cou- 
ple of tens. This fact significantly decreases our possibil- 
ity to study the evolution of program. The better sources 
of information about changes in computer programs are 
version control systems. One of them is Concurrent Ver- 
sions System (CVS). It keeps information about changes 
happened in short time intervals. 

Using the CVS in our work, we studied the histo- 
ries of three software projects: Mozilla web browser, 
Free-BSD Operating System and Gnu Emacs text edi- 
tor 0,0,0]. For each of these projects we analyzed 
only files written in the basic for the project language. 
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FIG. 1: Distribution P(A) for Free-BSD. 



FIG. 2: Distribution P(D) for Free-BSD. 



These are: C++ for Mozilla , C for Free-BSD and Lisp 
for Emacs. Header files for C/C++ were not studied. 
Total amounts of the processed files are approximately 
9000, 11000, 900 for Mozilla, Free-BSD, Emacs . Total 
lengths of RCS-files are MO 7 , MO 7 and 2-10 6 lines. Total 
amount of the data processed exceeds 2 Gigabytes. Due 
to some resource limitations only part of the Free-BSD 
CVS storage processed. Histories of all three projects are 
stored under control of the CVS and were publicly avail- 
able during our research period from the corresponding 
Internet servers 
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For each change of each file an amount D of deleted 
lines and an amount A of added lines were collected. 
Empty lines and comments were collected together with 
the rest of the data. A number of lines in the very first 
version of each file was not counted. Distributions P(A) 
and P(D) were evaluated for these two arrays Ai and 
Di. As an example the data for the Free-BSD are shown 
in the FIG. H an d FIG. in log-log scale. Results for 
the Emacs and Mozilla are similar. One can see that 
power functions are accurate approximations for P(A) 
and P{D): P(A) ~ A^,P(D) ~ D^ d . The values of 
exponents are the following: 

Free-BSD: ^ = -1.44 ± 0.02, \i d = -1.48 ± 0.02 
Mozilla: [i a = -1.43 ± 0.02, /Lt d = -1.47 ± 0.02 
Emacs: fi a = -1.39 ± 0.03, ^ d = -1.49 ± 0.04 
These scaling laws can be considered as a manifestation 
of the SOC in the evolution of software. 

One of the important notion being used in description 
of the SOC dynamic is the avalanche. The SOC process 
can be presented as a consequence of meta-stable states 
interrupted by the avalanche- like changes in the system. 
For evolution of computer program the close analog of 
the avalanche is a set of changes going on from version 
to version. We see that the avalanche statistic in evo- 
lution of software is described by power functions with 
nontrivial exponents. The universality of SOC dynamical 
mechanisms allows one to hope that a simple "holistic" 
model can be constructed for its quantitative description 



[ijj . To realize this idea for software evolution modeling 
we use the following assumptions. The specific of soft- 
ware changes is that one programmer can not modify a 
program at different points simultaneously (at least using 
a traditional development tools). The point of changes 
is characterized as "weakest" one in the program text: a 
programmer has some subjective estimation of parts of a 
program and makes changes in place which is estimated 
as extremely non-satisfactory. If the change is made on 
some point, corresponding changes must be made in some 
other places, i.e. in the program there is a coordination 
structure of its elements. We suppose that changes in the 
program can't make its size less that some minimal one. 

We formulate the model presenting this conception as 
follows. Computer program as a system constitutes a se- 
quence of elements - lines of code. At time point t the 
i—th line is characterized by a number bi(t), < &,(£) < 1 
representing its " fitness" in the program text or a barrier 
in respect to change in future stages of evolution. The 
state of the system of N elements is fully given by the set 
of barriers B(t) = {h(t),i = l,2,...,N}. The evolution 
of the program is described in our model as a sequence of 
B(t) for discrete time points t = 0, 1,2, .... The coordi- 
nation structure of program is presented by a network of 
its elements, where each element-node is conformed with 
its nears neighbors. The node having minimal barrier is 
defined as weakest unit of the system. At each time point 
t we define the set W(t) containing weakest unit with all 
its neighbors. We call W(t) the weakest spoil at time t. 

Dynamics in the model is defined in the following way. 
The initial number of nodes N(0) and the minimal pos- 
sible number of nodes K are supposed to be given. The 
initial values of barriers bi(0) are chosen at random. The 
state B(t) at time point t transforms into state B(t + 1) 
as follows. If the number of nodes N(t) in the system 
is more than K two kinds of changes are possible. With 
probability a, the weakest unit is deleted from the sys- 
tem or with probability 1 — a a new neighbor node to 
the weakest unit is inserted into the system. After that 
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the barriers of all nodes from weakest spoil W(t) are sat 
random. So if N(t) > K, the size of the system decreases 
or increases by one for one time step. If N(t) — K, then 
deletion is impossible and the above described insertion 
is made. 

Our model is a modification of well known Simple 
Model of Biological Evolution suggested by Bak and 
Sneppen 0], and its essential specific is that the number 
of system elements variates in time. In our study we have 
considered two versions of the model: with 1-dimensional 
(ID) and random neighbor (RN) coordination structure. 
In the ID case the nodes are organized into ID lattice 
with periodic boundary condition, and each node has two 
neighbors. In RN model, there is no fixed coordination 
structure in the system, and at each time step k random 
nodes are chosen as neighbors of weakest unit. We have 
considered the case k = 1 only. 

An avalanche as the elementary process of complex be- 
havior of non-equilibrium dynamical system can be de- 
fined in different ways. Usually in the model of SOC 
dynamics the A- and transient avalanches are considered 
HI EH El ■ In studies of our model we were interested 
mostly in transient avalanches. They can be defined as 
follows. Let at the time moment to the minimal bar- 
rier has the value fo- The sequence of S time steps 
during which the minimal barrier does not exceed fo- 
b m in(t) < fo,to < t < t + S is called transient avalanche 
or just avalanche if it finishes at the time point to + S 
when the value of minimal barrier becomes larger than 
fa- b m i n (t + S) > Jo- Distribution P{S) of avalanche 
temporal duration and distribution P{R) of avalanche 
spatial volume are important characteristics of the type 
of dynamics. For our model it is reasonable to consider 



two values as characteristics of volume of changes pro- 
duced in the system by avalanche. One of them is the 
number A of new elements appeared in the system at the 
end of avalanche. Other is the number D of elements dis- 
appeared from the system at the end of the avalanche. In 
dynamic of our model we studied mostly the distributions 
P(S), P(A), P(D) of temporal and spatial characteristics 
of avalanches. 

We studied numerically the ID and RN versions of 
the model for a = |- The initial size of the system 
was 8000 elements. The experiment went on until one 
million of avalanches were registered. We got the fol- 
lowing results. The P(S), P(A), P{D) distributions 
can be sufficiently approximated by the power functions 
P(S) ~ S T , P{A) ~ A**, P{D) ~ D^ d with exponents 
t = -1.358±0.005, /i Q -1.45±0.01, p d = -1.47±0.02 for 
the ID model and r = -1.901 ± 0.008, p a - 1.98 ± 0.01, 
fi d = -2.10 ± 0.02 for the RN model. 

For the RN model it is possible to obtain analytical de- 
scription in the framework of master equation formalism. 
To do it one can use the method of construction of mas- 
ter equation proposed for analysis of the SOC dynamic 
of random neighbor version of Bak-Sneppen model |22j . 
If we denote P n ,N(t) the probability that at time point t 
there are N nodes in the system, and n of ones have bar- 
riers less than A, where < A < 1, then the dynamical 
rules of RN model result in the following master equation 

PnM(t + l) = (a + (3d N<K+1 )PZ )N (t) + (3Pl N {t) (1) 

where (3 = 1 — a, and in terms of [/, = 1 — A, p n ,N = 
(n - 1)/ (N — 1), a n>N = 1 - Pn,N the quantities P^ N {t) 
and P% N (t) can be presented by the following relations: 



Pn^i 1 ) - A n+2.N-l P n+2M-l{t) + JV _ 1 P n +l,N-l (*) + C° ^^Pn.N-l (*) + Dn-l,N-l P n-l,N-l('t) + 

+E^_ 2N _ 1 P n -2,N-i(t) + {p 3 5 nfi + 3A/i 2 <S„4 + 3AV„, 2 + A 3 5„, 3 )Po,iv-i(*), 

PnM*) = Ai +2N+1 P n+2 ,N+l(t) + B d n+1N+1 P n+1:N+1 {t) + Ci N + 1 P n ,N+l{t) + (Mn,0 + \5 n ,l)P , N+1 (t), 

An,N = l^ 3 Pn,N, B^ lN — 3\p 2 p n< N + P?&n,Ni @n,N = 3A^ 2 CT„^ + 3A 2 pp n M , Dn,N = 3\ 2 p<Tn,N + A 3 p„,AT, 



Here, the coefficients A^ N , B% N , C% N , D% N , E£ >N , 
A d N , B% N , E* N are defined in the last two lines for 
< n < N. For n < and n > N they assumed to be 
zero. The master equation (1) enables one to find P n ,N{t) 
for t > 0, if initial values P n ,N(0) are given. Basing on 
this equation one can obtain analytical results for char- 
acteristics of dynamics in RN model. With that end in 
view it is convenient to use the formalism of generating 
function appeared to be very effective for construction 
of exact solution for master equations of RN version of 
Bak-Sneppen model Dynamic of RN 



model is more complex than one of Bak-Sneppen model, 
and solution of master equation for RN model for soft- 
ware evolution appears to be not easy problem. Here, we 
present only the exact result for Pjv(^) = J2n=o p n,N(t) 
being the probability that the system has n element 
with barriers less then A at time point t. Let us denote 
J\f(y,u) the generating function for probabilities Pjy(t): 
J\f(y, u) = Ew=K,t=0 PN{t)v N ' K u l . From (1) we obtain 
the following equation for N(y, u): 

N{y, u)[y - u(ay 2 + 0)] = yA%, 0) + u(3[y 2 - 1]AA(0, tt). 
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It describes 1-dimensional discrete diffusion with reflec- 
tion and can be solved by methods, used in [23H2^ . l25lE^ | 
. The result has the form 



yAf{y,0)(u- 



■ uarMij, 0)(y 2 



1) 



(u - r)[y - u(ay 2 +P)] 



where r = (1 — \fl — Au 2 a(3) /2ua is the analytical in the 
point u = solution of equation [t — u(ar 2 + 0)\ = 0. 
Mean value n(t) = K + Pjf(t)N of the system ele- 
ment number at time point t has the following asymptotic 
for large t: n(t) « (2a - l)t for a > 1/2, n(t) « ^2t/n 
for a — 1/2, and if we denote p ev (p d ) the probability 
that the initial number Y(0) of nodes is even (odd), then 

n(t) « K + [1 + (-l) t+K (l - 2af( Pev - Pod )]/[2(l - 2a)] 

for a < 1/2. The corrections to the leading terms of 
asymptotic are of the form : n(0) — ctz^2) + f{t)gi(t), 

for a > 1/2, t~ 1/2 5 3 (i) for a = 1/2 and f(t)g 2 (t), for 
a < 1/2. Here f(t) = [4a/3]*/ 2 /t- 3 / 2 and ffi (t), i = 1,2,3 
are bounded for large t , i.e. there are constants T, M 
that \gi(t)\ < M, if * > T. Since 4a/3 < 1 for a ^ 1/2, 
the function /(£) decreases exponentially fast for large t. 

The asymptotic behavior of n(t) demonstrates the dy- 
namical phase transition at the point a — 1/2 . For 
a < 1/2 the volume of system remains finite , but for 
a > 1/2 it can became as large as one likes. At the point 
a = 1/2, the dynamics of the system is critical one. 

In above formulated model we tried to present elemen- 
tary mechanisms of software changes. They are made 
by programmer locally in the place where these changes 
most of all needed. But a program changed in one place 
often must be changed in other places in some way con- 
nected to the first one. For example, in order to change 
the number of arguments of subroutine call, one needs to 
change not only the line containing the call operator but 
the definition of the subroutine either. This would lead 
to some subsequent changes of all the calls to the subrou- 
tine in all the program. If one adds the line in which some 
data read from a disk one should add some lines to check 
whether the data have been read successfully, and this in 
turn can require some change in the list of the modules 
included which in turn can cause a name conflict which in 
turn can cause other changes, etc.. Thus, the avalanche- 
like processes seems to be natural for modifications of 
programs. Avalanche ends up when all the parts of the 
program code are more or less satisfy some subjective 
and implicit criteria of programmer. Naively speaking, 
the program as whole becomes "a little bit better". In 
the model it can be presented as a process terminating 
when the value of minimal barrier becomes greater than 
initial one. This was the point why we studied tran- 
sient avalanches of self organization period and not the 
A-avalanches of the stationary mode. The obtained sta- 
tistical characteristics of avalanches make it possible to 
conclude that SOC is the dominating dynamical regime 



in evolution of free software. Our results demonstrate 
that the natural selection can create this type of " punc- 
tuated equilibrium" of such complex " virtual beings" in 
info-sphere. We believe that in the framework of pro- 
posed approach the modern methods of investigation of 
the SOC dynamics can appear to be very effective for 
studies of basic problem of software evolution. Our re- 
sults could be seen also as a theoretical prerequisite for 
the development of new tools and methods for advanced 
measures of software quality engineering. 
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